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Abstract 

A classic tension exists between exact inference 
in a simple model and approximate inference in 
a complex model. The latter offers expressivity 
and thus accuracy, but the former provides cover¬ 
age of the space, an important property for con¬ 
fidence estimation and learning with indirect su¬ 
pervision. In this work, we introduce a new ap¬ 
proach, reified context models, to reconcile this 
tension. Specifically, we let the amount of con¬ 
text (the arity of the factors in a graphical model) 
be chosen “at run-time” by reifying it—that is, 
letting this choice itself be a random variable in¬ 
side the model. Empirically, we show that our 
approach obtains expressivity and coverage on 
three natural language tasks. 

1. Introduction 

Many structured prediction tasks across natural language 
processing, computer vision, and computational biology 
can be formulated as that of learning a distribution over 
outputs 2 /i:l = (yi,.. ., j/l) e yi:L givcu an input x\ 

Pe{yi:L I x) ocexp . ( 1 ) 

The thirst for expressive models (e.g., where yi depends 
heavily on its context yi-.i-i) often leads one down the 
route of approximate inference, for example, to Markov 
chain Monte Carlo (Brooks et al., 2011), sequential Monte 
Carlo (Cappe et al., 2007; Doucet & Johansen, 2011), or 
beam search (Koehn et al., 2003). While these methods in 
principle can operate on models with arbitrary amounts of 
context, they only touch a small portion of the output space 
yi:L- Without such coverage, we miss out on two impor¬ 
tant but oft-neglected properties: 

• precision: In user-facing applications, it is important 
to only predict on inputs where the system is confident, 
leaving hard decisions to the user (Zhang et al., 2014). 
Lack of coverage means failing to consider all alterna¬ 


tive outputs, which leads to overconfidence and poor es¬ 
timates of uncertainty. 

• indirect supervision: When only part of the output yi:L 
is observed, lack of coverage is even more problematic 
than it is in the fully-supervised setting. An approximate 
inference algorithm might not even consider the true y 
(whereas one always has the true ^ in a fully-supervised 
setting), which leads to invalid parameter updates (Yu 
etal.,2013). 

Of course, lower-order models admit exact inference and 
ensure coverage, but these models have unacceptably low 
expressive power. Ideally, we would like a model that 
varies the amount of context in a judicious way, allocat¬ 
ing modeling power to parts of the input that demand it. 
Therein lies the principal challenge: How can we adap¬ 
tively choose the amount of context for each position i in a 
data-dependent way while maintaining tractability? 

In this paper, we introduce a new approach, which we call 
reified context models. The key idea is based on reification, 
a general idea in logic and programming languages, which 
refers to making something previously unaccessible (e.g., 
functions or metadata of functions) a “first-class citizen” 
and therefore available (e.g., via lambda abstraction or re- 
fiection) to formal manipulation. In the probabilistic mod¬ 
eling setting, we propose reifying the contexts as random 
variables in the model so that we can reason over them. 
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Figure 1 . Illustration of our method for handwriting recognition. 
At each position, we keep track of a collection of contexts, and 
learn a model that factorizes with respect to these contexts. Each 
context remembers a certain amount of history, e.g. is all 
length two sequences whose second character is o. By using con¬ 
texts at multiple levels of resolution, we can obtain coverage of 
the entire space while still modeling complex dependencies. 
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Specifically, for each i G {1,...,!/ — 1}, we maintain a 
collection Ci of contexts, each of which is a subset of 
representing what we remember about the past (see Fig¬ 
ure 1 for an example). We define a joint model over (^, c), 
suppressing x for brevity: 

P0{yi:L,Cl:L-l) « GXp (Cj-i, 2/i) ^ X K{y,c), 

( 2 ) 

where n is a. consistency potential, to be explained later. 
The features (j)i now depend on the current context c^_i, 
rather than the full history yi-.i-i. The distribution over 
(^, c) factorizes according to the graphical model below: 



The factorization (3) implies that the family in (2) admits 
efficient exact inference via the forward-backward algo¬ 
rithm as long as each collection Ci has small cardinality. 

Adaptive selection of context. Given limited computa¬ 
tional resources, we only want to track contexts that are 
reasonably likely to contain the answer. We do this by se¬ 
lecting the context sets Ci during a forward pass using a 
heuristic similar to beam search, but unlike beam search, 
we achieve coverage because we are selecting contexts 
rather than individual variable values. We detail our selec¬ 
tion method, called RCMS, in Section 4. We can think of 
selecting the C^’s as selecting a model to perform inference 
in, with the guarantee that all such models will be tractable. 
Our method is simple to implement; see the appendix for 
implementation details. 

The goal of this paper is to fiesh out the framework de¬ 
scribed above, providing intuitions about its use and ex¬ 
ploring its properties empirically. To this end, we start in 
Section 2 by defining some tasks that motivate this work. 
In Sections 3 and 4 we introduce reified context models 
formally, together with an algorithm, RCMS, for select¬ 
ing contexts adaptively at run-time. Sections 5-7 explore 
the empirical properties of the RCMS method. Finally, we 
discuss future research directions in Section 9. 

2. Description of Tasks 

To better understand the motivation for our work, we 
present three tasks of interest, which are also the tasks used 
in our empirical evaluation later. These tasks are word 
recognition (a fully supervised task), speech recognition 
(a weakly supervised task), and decipherment (an unsuper¬ 
vised task). The first of these tasks is relatively easy while 


the latter two are harder. We use word recognition to study 
the precision of our method, the other two tasks to explore 
learning under indirect supervision, and all three to under¬ 
stand how our algorithm selects contexts during training. 

Word recognition The first task is the word recognition 
task from Kassel (1995); we use the “clean” version of the 
dataset as in Weiss et al. (2012). This contains 6,876 ex¬ 
amples, split into 10 folds (numbered 0 to 9); we used fold 
1 for testing and the rest for training. Each input is a se¬ 
quence of 16 X 8 binary images of characters; the output is 
the word that those characters spell. The first character is 
omitted due to capitalization issues. Since this task ended 
up being too easy as given, we downsampled each image to 
be 8 X 4 (by taking all pixels whose coordinates were both 
odd). An example input and output is given below: 

input X I J i = 4 . 0 ^ 

output projections 

Each individual image is too noisy to interpret in isolation, 
and so leveraging the context of the surrounding characters 
is crucial to achieving high accuracy. 

Speech recognition Our second task is from the Switch¬ 
board speech transcription project (Greenberg et al., 1996). 
The dataset consists of 999 utterances, split into two chunks 
of sizes 746 and 253; we used the latter chunk as a test set. 
Each utterance is a phonetic input and textual output: 

input X h# y ae ax s w ih r del d h# 
latent ^ (alignment) 

output y yeah_it’s_weird 

The alignment between the input and output is unobserved. 

The average input length is 26 phonemes, or 2.5 seconds 
of speech. We removed most punctuation from the output, 
except for spaces, apostrophes, dashes, and dots. 

Decipherment Our final task is a decipherment task sim¬ 
ilar to that described in Nuhn & Ney (2014). In decipher¬ 
ment, one is given a large amount of plain text and a smaller 
amount of cipher text; the latter is drawn from the same dis¬ 
tribution as the former but is then passed through a 1-to-l 
substitution cipher. Eor instance, the plain text sentence ‘T 
am what I am” might be enciphered as “13 5 54 13 5”: 

latent 2 ; I am what I am 

output^ 13 5 54 13 5 

The task is to reverse the substitution cipher, e.g. determine 
that 13 1 -^ /, 5 i-G- am, etc. 

We extracted a dataset from the English Gigaword corpus 
(Graff & Cieri, 2003) by finding the 500 most common 
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words and filtering for sentences that only contained those 
words. This left us with 24,666 utterances, of which 2, 000 
were enciphered and the rest were left as plain text. 

Note that this task is unsupervised, but we can hope to gain 
information about the cipher by looking at various statistics 
of the plaintext and ciphertext. For instance, a very basic 
idea would be to match words based on their frequency. 
This alone doesn’t work very well, but by considering bi¬ 
gram and trigram statistics we can do much better. 

3. Reified Context Models 

We now formally introduce reified context models. Our 
setup is structured prediction, where we predict an output 
(^1: • • • 5 Vl) ^ 3^1 X • • • X J’/,; we abbreviate these as yi^L 
and 3^1 :L- While this setup assumes a generative model, we 
can easily handle discriminative models as well as a vari¬ 
able length L; we ignore these extensions for simplicity. 

Our framework reifies the idea of context as a tool for effi¬ 
cient inference. Informally, a context for yi is information 
that we remember about yi-.i-i. In our case, a context Q-i 
is a subset of which should contain in other 

words, Ci-i is “remembering” that yi-.i-i G c^-i. A con¬ 
text set Ci-i is a collection of possible values for c^-i. 

Formally, we define a canonical context set Ci to be a col¬ 
lection of subsets of yi:i satisfying two properties:^ 

• coverage: yi^i G Ci 

• closure: for c, c' G C^, c D c' G U {0} 

An example of such a collection is given in Figure 2; as in 
Section 1, notation such as xxa denotes the subset of 3^i:3 
where y^ = a. 

We refer to each element of Ci as a “context”. Given a 
sequence we need to define contexts ci:l-i such that 
yi:i G Ci for all i. The coverage property ensures that some 
such context always exists: we can take Q = 3 ^ 1 ;^. 

In reality, we would like to use the smallest (most precise) 
context Ci possible; the closure property ensures that this 
is canonically defined: given a context q_i G Ci-i and a 
value yi G we inductively define Ci = fi{ci-i,yi) to 
be the intersection of all c e Ci that contain c^_i x or 
equivalently the smallest such c. Example evaluations of / 
are provided in Figure 2. Note that yi:i G Ci always. 

We now define a joint model over the variables yi:L and 

^ This is similar to the definition of a hierarchical decompo¬ 
sition from Steinhardt & Liang ( 2014 ). Our closure condition re¬ 
places the more restrictive condition that c fl c' G {c, c', 0 }. 


Figure 2 . Illustration of a context set C3. These sets form a hierar¬ 
chy, allowing us to focus on certain specific values in 3^1:3, while 
also allocating some resources (via the context xxx) to model the 
rest of 3^1:3 as well. To the right are some example outputs of fs. 
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'kca 

'k'ka 

abc 

-kbc 



h{'kb,a) = x6a 
/ 3 (xa, a) = 'k'ka 
h(bb,c) =i<bc 

/3(a6, b) = XXX 


contexts ci-l-F 


Pe{yi:L,ci:L-i) oc cxp (q_i , ^i)^ X n{y , c), 


where/i:(^,c) = ]\^I 2 l[ci = /i(ci_i,^i)] enforces con¬ 
sistency of contexts. The distribution po factors according 
to (3). One consequence of this is that the variables yi^p 
are jointly independent given the contexts contain 

all information about interrelationships between the yi. 

Mathematically, the model above is similar to a hidden 
Markov model where Ci is the hidden state. However, we 
choose the context sets adaptively, giving us much greater 
expressive power than an HMM, since we essentially have 
exponentially many choices of hidden states (canonical 
context sets) to select from at runtime. 

Example: 2nd-order Markov chain. To provide more 
intuition, we construct a 2nd-order Markov chain using our 
framework (we can construct nth-order Markov chains in 
the same way). We would like Ci to “remember” the pre¬ 
vious 2 values, i.e. {yi-i^yi). To do this, we let Ci con¬ 
sist of all sets of the form 3 ^i:i _2 x {(^^_i, ^^)}; these sets 
fix the value of {yi-i^yi) while allowing yi:i -2 to vary 
freely. Then /i+i(ci, yi+i) = yi-.i-i x {{yi, t/i+i)}, which 
is well-defined since yi can be determined from Ci . 

If |3^^| = V, then \Ci\ = V‘^ (or for nth-order chains), 
refiecting the cost of inference in such models. 

As a technical note, we also need to include yi^i in Ci to 
satisfy the coverage condition. However, yi-i will never 
actually appear as a context, as can be seen by the preced¬ 
ing definition of /. 

To finish the construction, suppose we have a family of 
2nd-order Markov chains parameterized as 

Pe{yi:n) oc exp • (4) 

Since (pi depends only on (^^_ 2 ,^i-i), which can be de¬ 
termined from Ci-i, we can define an equivalent function 
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^i{ci-i^yi). Doing so, we recover a model family equiva¬ 
lent to (4) after marginalizing out (since ci:L-i is a 

deterministic function of this last step is trivial). 


4. Adaptive Context Sets 

The previous section shows how to define a tractable model 
for any collection of canonical context sets Ci. We now 
show how to choose such sets adaptively, at run-time. We 
use a heuristic motivated by beam search, which greedily 
chooses the highest-scoring configurations of based on 
an estimate of their mass. We work one level of abstraction 
higher, choosing contexts instead of configurations', this al¬ 
lows us to maintain coverage while still being adaptive. 

Our idea has already been illustrated to an extent in Fig¬ 
ures 1 and 2: if some of our contexts are very coarse (such 
as in Figure 2) and others are much finer (such as ahc in 
Figure 2), then we can achieve coverage of the space while 
still modeling complex dependencies. We will do this by 
allowing each context c G to track a suffix of yi:i of 
arbitrary length; this contrasts with the Markov chain ex¬ 
ample where we always track suffixes of length 2. 

Precise contexts expose more information about yi^i and so 
allow more accurate modeling; however, they are small and 
Ci is necessarily limited in size, so only a small part of the 
space can be precisely modeled in this way. We thus want 
to choose contexts that focus on high probability regions. 


Our procedure. To do this, we define the partial model 


cxexp j (j)j{cj-i,yj) ] x K{y,c). 

u=i 


Caveat. There is no direct notion of an inference error in 
the above procedure, since exact inference is possible by 
design. An indirect notion of inference error is poor choice 
of contexts, which can lead to less accurate predictions. 

4.1. Relationship to beam search 

The idea of greedily selecting contexts based on is simi¬ 
lar in spirit to beam search, an approximate inference algo¬ 
rithm that greedily selects individual values of yi:i based 
on Qq. More formally, beam search maintains a beam 
Bi ^ yi:i of size B, constructed as follows: 

• Let Bi = Bi-i X yi. 

• Compute the mass of each element of Bi under Qq . 

• Let Bi be the B elements of Bi with highest mass. 

The similarity can be made precise: beam search is a de¬ 
generate instance of our procedure. Given Bi, let Ci = 
{{b} \ b e Bi} U {yi:i}. Then Ci consists of singleton sets 
for each element of Bi, together with yi^i in order to en¬ 
sure coverage. To get back to beam search (which doesn’t 
have coverage), we add an additional feature to our model: 
I[g = 3^1 :i]. We set the weight of this feature to — oo, as¬ 
signing zero mass to everything outside of Bi . 

Given any algorithm based on beam search, we can im¬ 
prove it simply by allowing the weight on this additional 
feature to be learned from data. This can help with the pre¬ 
cision ceiling issue by allowing us to reason about when 
beam search is likely to have made a search error. 

4.2. Featurizations 

We end this section with a recipe for choosing features 
fii{ci-i^yi). We focus on n-gram and alignment features, 
which are what we use in our experiments. 


We then define the context sets inductively via the follow¬ 
ing procedure, which takes as input a context size B\ 

• Let Ci = {ci-i X {yi} \ a-i € Ci-i,yi G 3^*}. 

• Compute the mass of each element of Ci under Qq. 

• Let Ci be the B elements of Ci with highest mass, to¬ 
gether with the set yi:i. 

The remaining elements of Ci will effectively be merged 
into their least ancestor in Ci . Note that each c e Ci fixes 
the value of some suffix yj-i of yi^i, and allows yi-.j-i to 
vary freely across yi-j-i. Any such collection will auto¬ 
matically satisfy the closure property. 

The above procedure can be performed during the forward 
pass of inference, and so is cheap computationally. Imple¬ 
mentation details can be found in the appendix. We call 
this procedure RCMS (short for “Reified Context Model 
Selection”). 


n-gram features. We consider nth-order Markov chains 
over text, typically featurized by (n + 1)-grams: 

= mvi-n-.i = y])yeyi_„:,- 

To extend this to our setting, define = J’i U {x} and 
— 11}=! yi' identify each pair (q_i, ^i) with 

a sequence s = cr{ci-i,yi) G yi in the same way as be¬ 
fore: in each position j < i where yj is determined by 
{ci-i,yi), Sj = yj \ otherwise, Sj = x. We then define our 
n-gram model on the extended space 

(f>i{ci-i,yi) = {l[a{ci-i,yi) = y])^^y._^,.- ( 6 ) 

Alignments. In the speech task from Section 2, we have 
an input xi^l^ and output yi:L, where x and y have different 
lengths and need to be aligned. To capture this, we add an 
alignment z to the model, such as the one below: 
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We represent z as a bipartite graph between {1,..., L} and 
{1,..., L'} with no crossing edges, and where every node 
has degree at least one. The non-crossing condition allows 
one phoneme to align to multiple characters, or one char¬ 
acter to align to multiple phonemes, but not both. Our goal 
is to model pe{y^z \ x). 

To featurize alignment models, we place n-gram features 
on the output i/i, as well as on every group of n consecu¬ 
tive edges. In addition, we augment the context q to keep 
track of what i/i most recently aligned to (so that we can en¬ 
sure the alignment is monotonic). We also maintain the B 
best contexts at position i separately for each of the L' pos¬ 
sible values of zp, this modification to the RCMS heuristic 
encourages even coverage of the space of alignments. 

5. Generating High Precision Predictions 

Recall that one symptom stemming from a lack of coverage 
is poor estimates of uncertainty and the inability to generate 
high precision predictions. In this section, we show that the 
coverage offered by RCMS mitigates this issue compared 
to beam search. 

Specifically, we are interested in whether an algorithm can 
find a large subset of test examples that it can classify with 
high (« 99%) accuracy. Formally, assume a method out¬ 
puts a prediction y with confidence p G [0,1] for each ex¬ 
ample. We sort the examples by confidence, and see what 
fraction R of examples we can answer before our accuracy 
drops below a given threshold P. In this case, P is the 
precision and R is the recall. 

Having good recall at high levels of precision (e.g., P = 
0.99) is useful in applications where we need to pass on 
predictions below the precision threshold for a human to 
verify, but where we would still like to classify as many 
examples as possible automatically. 

We ran an experiment on the word recognition dataset de¬ 
scribed in Section 2. We used a 4-gram model, training 
both beam search (with a beam size of 10) and RCMS 
(with 10 contexts per position, not counting 3^i:i). In ad¬ 
dition, we used beam search with a beam size of 200 to 
simulate almost-exact inference. To train the models, we 
maximized the approximate log-likelihood using AdaGrad 
(Duchi et al., 2010) with a step size y of 0.2 and S = 10“^. 

The precision-recall curve for each method is plotted in 
Figure 3; confidence is the probability the model assigns 
to the predicted output. Note that while beam search and 
RCMS achieve similar accuracies (precision at R = 1) on 
the full test set (87.1% and 88.5%, respectively), RCMS 
is much better at separating out examples that are likely to 



Figure 3. On word recognition, precision-recall curve of beam 
search with beam size 10, RCMS with 10 contexts per position, 
and almost-exact inference simulated by beam search with a beam 
of size 200. Beam search makes errors even on its most confident 
predictions, while RCMS is able to separate out a large number 
of nearly error-free predictions. 

be correct. The fiat region in the precision-recall curve for 
beam search means that the model confidence and actual 
error probability are unrelated across that region. 

As a result, there is a precision ceiling, where it is simply 
impossible to obtain high precision at any reasonable level 
of recall. To quantify this effect, note that the recall at 99% 
precision for beam search is only 16%, while for RCMS 
it is 82%. For comparison, the recall for exact inference 
is only 4% higher (86%). Therefore, RCMS is nearly as 
effective as exact inference on this metric while requiring 
substantially fewer computational resources. 

6. Learning with Indirect Supervision 

The second symptom of lack of coverage is the inabil¬ 
ity to learn from indirect supervision. In this setting, 
we have an exponential family model pe{y,z \ x) oc 
exp(6>^(/)(x, z)), where x and y are observed during 

training but 2 ; is unobserved. The gradient of the (marginal) 
log-likelihood is: 

V\ogpe(.y I x) = [(t)(x,y,z)] (7) 

~ ^y,Z'^pe(v,z\x) [0(^i ^)] > 

which is the difference between the expected features with 
respect to the target distribution po{z \ x^y) and the model 
distribution pe{y^z \ x). In the fully supervised case, 
where we observe 2;, the target term is simply (/)(x, y, z), 
which provides a clear training signal without any infer¬ 
ence. With indirect supervision, even obtaining a training 
signal requires inference with respect to po {z \ x^y), which 
is generally intractable. 

In the context of beam search, there are several strategies 
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to inferring z for computing gradients: 

• Select-by-model: select beams based on | x), then 
re-weight at the end by pe{y \ z^x). This only works if 
the weights are high for at least some “easy” examples, 
from which learning can then bootstrap. 

• Select-by-target: select beams based on ql{z \ x^y). 
Since y is not available at test time, parameters 0 learned 
conditioned on y do not generalize well. 

• Hybrid: take the union of beams based on both the 
model and target distributions. 

• Forced decoding (Gorman et al., 2011): first train a 
simple model for which exact inference is tractable to 
infer the most likely z, conditioned on x and y. Then 
simply fix z; this becomes a fully-supervised problem. 

To understand the behavior of these methods, we used them 
all to train a model on the speech recognition dataset from 
Section 2. The model places 5-gram indicator features on 
the output as well as on the alignments. We trained using 
AdaGrad with step size = 0.2 and 5 = 10“^. For each 
method, we set the beam size to 20. For forced decoding, 
we used a bigram model with exact inference to impute z 
at the beginning. 

The results are shown in Figure 4(a). Select-by-model 
doesn’t learn at all: it only finds valid alignments for 2 out 
of the 746 training examples; for the rest, pe{y \ is 
zero for all alignments considered, thus providing no signal 
for learning. Select-by-target quickly reaches high train¬ 
ing accuracy, but generalizes extremely poorly because it 
doesn’t learn to keep the right answer on the beam. The 
hybrid approach does better but still not very well. The 
only method that learns effectively is forced decoding. 

While forced decoding works well, it relies on the idea that 
a simple model can effectively determine z given access to 
X and y. This will not always be the case, so we would like 
methods that work well even without such a model. Reified 
context models provide a natural way of doing this: we 
simply compute po{z \ x, y) under the contexts selected by 
RCMS, and perform learning updates in the natural way. 

To test RCMS, we trained it in the same way using 20 con¬ 
texts per position. Without any need for an initialization 
scheme, we obtain a model whose test accuracy is better 
than that of forced decoding (see Figures 4(b),4(c)). 


Decipherment: Unsupervised Learning. We now turn 
our attention to an unsupervised problem: the decipher¬ 
ment task from Section 2. We model decipherment as a hid¬ 
den Markov model (HMM): the hidden plain text evolves 
according to an n-th order Markov chain, and the cipher 
text is emitted based on a deterministic but unknown 1:1 
substitution cipher (Ravi & Knight, 2009). 


All the methods we described for speech recognition break 
down in the absence of any supervision except select-by- 
model. We therefore compare only three methods: select- 
by-model (beam search), RCMS, and exact inference. We 
trained a Ist-order (bigram) HMM using all 3 methods, and 
a 2nd-order (trigram) HMM using only beam search and 
RCMS, as exact inference was too slow (the vocabulary 
size is 500). We used the given plain text to learn the tran¬ 
sition probabilities, using absolute discounting (Ney et al., 
1994) for smoothing. Then, we used EM to learn the tran¬ 
sition probabilities; we used Laplace smoothing for these 
updates. 

The results are shown in Figure 5. We measured perfor¬ 
mance by mapping accuracy: the fraction of unique sym¬ 
bols that are correctly mapped (Nuhn et al., 2013). First, 
we compared the overall accuracy of all methods, setting 
the beam size and context size both to 60. We see that all 
2nd-order models outperform all Ist-order models, and that 
beam search barely learns at all for the Ist-order model. 

Restricting attention to 2nd-order models, we measure the 
effect of beam size and context size on accuracy, plotting 
learning curves for sizes of 10, 20, 30, and 60. In all cases, 
RCMS learns more quickly and converges to a more accu¬ 
rate solution than beam search. The shapes of the learning 
curves are also different: RCMS learns quickly after a few 
initial iterations, while beam search slowly accrues infor¬ 
mation at a roughly constant rate over time. 

7. Refinement of Contexts During Training 

When learning with indirect supervision and approximate 
inference, one intuition is that we can “bootstrap” by first 
learning from easy examples, and then using the informa¬ 
tion gained from these examples to make better inferences 
about the remaining ones (Liang et al., 2011). However, 
this can fail if there are insufficiently many easy examples 
(as in the speech task), if the examples are hard to identify, 
or if they differ statistically from the remaining examples. 

We think of the above as “vertical bootstrapping”: using the 
full model on an increasing number of examples. RCMS 
instead performs “horizontal bootstrapping”: for each ex¬ 
ample, it selects a model (via the context sets) based on the 
information available. As training progresses, we expect 
these contexts to become increasingly fine as our parame¬ 
ters improve. 

To measure this quantitatively, we define the length of a 
context Ci-i to be the number of positions of yi-.i-i that 
can be determined from c^_i (number of non-^’s). We plot 
the average length (weighted by mass under q^) as train¬ 
ing progresses. The averages are updated every 50 and 100 
training examples respectively for word and speech recog¬ 
nition. For decipherment, they are computed once for each 
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Figure 4. Left: character error rate (CER) of all beam search-based methods on the speech task, for 5 passes of the training data; note 
that an empty output always has a CER of 1.0. Middle: CER of forced decoding and RCMS over 5 random permutations of the data; 
the solid line is the median. Right: exact-match accuracy over the same 5 permutations. 
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Figure 5. Results on the decipherment task. Left: accuracy for a fixed beam/context size as the model order varies; approximate inference 
with a 2nd-order HMM using RCMS outperforms both beam search in the same model and exact inference in a simpler model. Right: 
effect of beam/context size on accuracy for the 2nd-order HMM. RCMS is much more robust to changes in beam/context size. 


full pass over the training data (since EM only updates the 
parameters once per pass). 

Figure 6 shows that the broad trend is an increase in the 
context length over time. For both the word and speech 
tasks, there is an initial overshoot at the beginning that is 
not present in the decipherment task; this is because the 
word and speech tasks are trained with stochastic gradient 
methods, which often overshoot and then correct in param¬ 
eter space, while for decipherment we use the more stable 
EM algorithm. 

Since we start by using coarse contexts and move to finer 
contexts by the end of training, RCMS can be thought of 
as a coarse-to-fine training procedure (Petrov & Charniak, 
2011). However, instead of using a pre-defined, discrete 
set of models for initialization, we organically adapt the 
amount of context on a per-example basis. 

8. Related work 

Kulesza & Pereira (2007) first study the interaction be¬ 


tween approximate inference and learning, showing that 
even in the fully supervised case approximate inference 
can be seriously detrimental; Finley & Joachims (2008) 
show that approximate inference algorithms which over¬ 
generate possible outputs interact best with learning; this 
further supports the need for coverage when learning. 

Four major approaches have been taken to address the prob¬ 
lem of learning with inexact inference. The first modifies 
the learning updates to account for the inference procedure, 
as in the max-violation perceptron and related algorithms 
(Huang et al., 2012; Zhang et al., 2013; Yu et al., 2013); 
reinforcement learning approaches to inference (Daume III 
et al., 2009; Shi et al., 2015) also fit into this category. An¬ 
other approach modifies the inference algorithm to obtain 
better coverage, as in coarse-to-fine inference (Petrov et al., 
2006; Weiss et al., 2010), where simple models are used to 
direct the focus of more complex models. Pal et al. (2006) 
encourage coverage for beam search by adaptively increas¬ 
ing the beam size. A third approach is to use inference 
procedures with certificates of optimality, based on either 
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Figure 6. Average context length vs. number of learning updates for the word recognition, speech, and decipherment tasks. For word 
and speech recognition we take a cumulative average (to reduce noise). 


duality gaps from convex programs (Sontag, 2010) or vari¬ 
ational bounds (Xing et al., 2002; Wainwright et al., 2005). 

Finally, another way of sidestepping the problems of ap¬ 
proximate inference is to learn a model that is already 
tractable. While classical tractable model families based 
on low treewidth are often insufficiently expressive, more 
modern families have shown promise; for instance, sum- 
product networks (Poon & Domingos, 2011) can express 
models with high treewidth while still being tractable, and 
have achieved state-of-the-art results for some tasks. Other 
work includes exchangeable variable models (Niepert & 
Domingos, 2014) and mean-field networks (Li & Zemel, 
2014). 

Our method RCMS also attempts to define tractable model 
families, in our case, via a parsimonious choice of la¬ 
tent context variables, even though the actual distribution 
over ui-L may have arbitrarily high treewidth. We adap¬ 
tively choose the model structure for each example at “run¬ 
time”, which distinguishes our approach from the afore¬ 
mentioned methods, though sum-product networks have 
some capacity for expressing adaptivity implicitly. We be¬ 
lieve that such per-example adaptivity is important for ob¬ 
taining good performance on challenging structured predic¬ 
tion tasks. 

Certain smoothing techniques in natural language process¬ 
ing also interpolate between contexts of different order, 
such as absolute discounting (Ney et al., 1994) and Kneser- 
Ney smoothing (Kneser & Ney, 1995). However, in such 
cases all observed contexts are used in the model; to get the 
same tractability gains as we do, it would be necessary to 
adaptively sparsify the model for each example at run-time. 
Some Bayesian nonparametric approaches such as infinite 
contingent Bayesian networks (Milch et al., 2005) and hi¬ 
erarchical Pitman-Yor processes (Teh, 2006; Wood et al., 
2009) also reason about contexts; again, such models do 
not lead to tractable inference. 


9. Discussion 

We have presented a new framework, reified context mod¬ 
els, that reifies context as a random variable, thereby defin¬ 
ing a family of expressive but tractable probability distribu¬ 
tions. By adaptively choosing context sets at run-time, our 
RCMS method uses short contexts in regions of high un¬ 
certainty and long contexts in regions of low uncertainty, 
thereby reproducing the behavior of coarse-to-fine train¬ 
ing methods in a more organic and fine-grained manner. 
In addition, because RCMS maintains full coverage of the 
space, it is able to break through the precision ceiling faced 
by beam search. Coverage also helps with training under 
indirect supervision, since we can better identify settings 
of latent variables that assign high likelihood to the data. 

At a high level, our method provides a framework for struc¬ 
turing inference in terms of the contexts it considers; be¬ 
cause the contexts are reified in the model, we can also sup¬ 
port queries about how much probability mass lies in each 
context. These two properties together open up intriguing 
possibilities. For instance, one could imagine a multi-pass 
approach to inference where the first pass uses small con¬ 
text sets for each location, and later passes add additional 
contexts at locations where there is high uncertainty. By 
adaptively adding context only when it is needed, we could 
speed up inference by a potentially large amount. 

Another direction of research is to extend our construction 
beyond a single left-to-right ordering. In principle, we can 
consider any collection of contexts that induce a graphical 
model with low treewidth, rather than only considering the 
factorization in (3). For problems such as image segmenta¬ 
tion where the natural structure is a grid rather than a chain, 
such extensions may be necessary. 

Finally, while we currently learn how much weight to as¬ 
sign to each context, we could go one step further and learn 
which contexts to propose and include in the context sets Ci 
(rather than relying on a fixed procedure as in the RCMS 
algorithm). Ideally, one could specify a large number of 
possible strategies for building context sets, and the best 
strategy to use for a given example would be learned from 
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data. This would move us one step closer to being able to 
employ arbitrarily expressive models with the assurance of 
an automatic inference procedure that can take advantage 
of the expressivity in a reliable manner. 
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A. Implementation Details 

Recall that to implement the RCMS method, we need to 
perform the following steps: 

1. Let Ci = {q_i X {yi} \ q_i G Ci-i,yi G X}- 

2. Compute what the mass of each element of Ci would 
be if we used as the model and Ci as the collection 
of contexts. 

3. Let Ci be the B elements of Ci with highest mass, to¬ 
gether with the set 3 ^ 1 ;^. 

As in Section 4, each context in Ci can be represented by a 
string si:i, where Sj G yj U We will also assume an 
arbitrary ordering on yj U that has ^ as its maximum 
element. 

In addition, we use two datatypes: E (for “expand”), which 
keeps track of elements of Ci, and M (for “merge”), which keeps 
track of elements of Ci. More precisely, if Ci-i is represented 
by an object rrii-i of type M, then E(mi_i, yi) represents Ci-i x 
{yi}', 3ndM{E{mi-i, yi)) represents Ci-i x {yi} as well, with the 
distinction that it is a member of Ci rather than Ci. The distinction 
is important because we will also want to merge smaller contexts 
into objects of type M. For both E and M objects, we maintain a 
field len, which is the length of the suffix of yi:i that is specified 
(e.g., if an object represents 3 ^ 1:3 x { 2 / 4 : 5 }, then its len is 2). 

Throughout our algorithm, we will maintain 2 invariants: 

• Ci and Ci will be sorted lexicographically (e.g. based first 
on Si, then s^-i, etc.) 

• A list \csi of length \en{Ci) is maintained, such 
that the longest common suffix of Ci[a] and Ci[b] is 
mince[a,6) Ics^c]. A similar list Ics^ is maintained for Ci. 

Step 1. To perform step 1 above, we just 

do: 

c' = D 

for j = 0 to len(3^i) — 1 do 
for /c = 0 to len(Ci-i) — 1 do 
if /c + 1 < len(Ci-i) then 
lcSi.append(lcSi_i[A:] + 1) 

else_ 

lcSi.append(O) 

end if 

Ci.SLppend(E(Ci[k], yi[j])) 

end for 
end for 

The important observation is that if two sequences end in the same 
character, their Ics is one greater than the Ics of the remaining se¬ 
quence without that character; and if they end in different charac¬ 
ters, their Ics is 0. 

Each E keeps track of a forward score, defined as 

E(m, 2 /). forward = m. forward x ex.p(0^(t)(m,y)). (8) 


Step 2. For step 2, we find the B elements c of Ci with the 
largest forward score; we set a flag c.active to true for each such 

c. 

Step 3. Step 3 contains the main algorithm challenge, which is 
to efficiently merge each element of Ci into its least ancestor in Ci . 
If we think of Ci as a tree (as in Figure 2), we can do this by essen¬ 
tially performing a depth-first-search of the tree. The DFS goes 
backwards in the lexicographic ordering, so we need to reverse the 
lists Ci and Ics^ at the end. 

> merge and update Ics 
stack = [] 

\cSi = [] 

/ ^ 00 

for j = len(Ci) — 1 to 0 do 

I ^ min(/, lcs4i]) 

while I < stack[—1]. len do 

> then current top of stack is not an ancestor of Ci [j] 
stack.popO 

end while 
if Ci [j] .active then 
m = M{Ci[j]) 
lcsi.append(/) 

Ci.append(?Ti) 

stack.push(m) 

I ^ 00 
else 

> merge Ci [j] into its least ancestor 
stack[—1]. absorb(Ci[3]) 

end if 
end for 
Icsi.reverseO 
Ci.reverseO 

If m G Ci has absorbed elements ei,... ,ek, then we compute 
m.forward as .forward. 

After we have constructed Ci,... ,Ci-i, we also need to send 
backward messages for inference. If e G Ci is merged into 
m G Ci, then e. backward = m. backward. If m G Ci 
expands to E{m,y) for y G d^i+i, then m. backward = 
^yeyi+i ^(^5 y)' backward x exp{0^(j){m, y)). The (un¬ 
normalized) probability mass of an object is then simply the prod¬ 
uct of its forward and backward scores; we can compute the nor¬ 
malization constant by summing over Ci. 

In summary, our method can be coded in three steps; first, during 
the forward pass of inference, we: 

1. Expand to Ci and construct Icsi. 

2. Sort by forward score and mark active nodes in Ci for inclu¬ 
sion in Ci. 

3. Merge each node in Ci into its least ancestor in Ci, using a 
depth-first- search. 

Finally, once all of the Ci are constructed, we perform the back¬ 
ward pass: 

4. Propagate backward messages and compute the normaliza¬ 
tion constant. 
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B. Further Details of Experimental Setup 

We include here a few experimental details that did not fit 
into the main text. When training with AdaGrad, we per¬ 
formed several stochastic gradient updates in parallel, sim¬ 
ilar to the approach described in Recht et al. (2011) (al¬ 
though we parallelized even more aggressively at the ex¬ 
pense of theoretical guarantees). We also used a random¬ 
ized truncation scheme to round most small coordinates of 
the gradients to zero, which substantially reduces memory 
usage as well as concurrency overhead. 

For decipherment, we used absolute discounting with dis¬ 
count 0.25 and smoothing 0.01, and Laplace smoothing 
with parameter 0.01. For the Ist-order model, beam search 
performs better if we use Laplace smoothing instead of ab¬ 
solute discounting (though still worse than RCMS). In or¬ 
der to maintain a uniform experimental setup, we excluded 
this result from the main text. 

For the hybrid selection algorithm in the speech experi¬ 
ments, we take the union of the beams at every step (as op¬ 
posed to computing two sets of beams separately and then 
taking a single union at the end). 

C. Additional Files 

In the supplementary material, we also include the source 
code and datasets for the decipherment task. A README is 
included to explain how to run these experiments. 



